-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support storing precision
of decimal types in Schema
class
#17176
Conversation
Signed-off-by: Nghia Truong <[email protected]>
precision
variable for DType
class in DType.java
precision
of decimal type in DType
and Schema
classes
Signed-off-by: Nghia Truong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this is going to be good from a design standpoint. I also don't think that this solves the issue that you are complaining about.
CUDF does not store precision
with their decimal type, so if we round trip the type to CUDF and back (like say in a LIST of DECIMALs) the precision will be lost. That is totally unexpected for a user. CUDF also will not enforce this precision in any way, or pass it on when doing computaion. This precision is just meta data that is going to be thrown away/ignored by CUDF. This violates the principal of least surprise.
We also have ways to include precision for the few places that CUDF uses it. (writing parquet/orc)
private int precision; |
I don't see any value in doing this unless CUDF is going to truly support precision.
We need to convert a Spark schema into cudf schema. When reading JSON, we also need to convert strings to decimals using the precision from Spark |
The issue I have with this is that it violates the principal of least surprise. I get that there are use cases where the code will be simpler/cleaner if we can put the precision in with the DType. It would be a lot cleaner if we could have the precision be in the DType when we want to write a parquet or ORC file. But, in my opinion, those benefits don't outweigh the harm caused by someone expecting the precision to be properly reflected everywhere and in reality it is not. For example when
Also technically a precision of 0 is valid (at least by Spark). It can only ever hold the value 0 or null, so it is close to useless. But it is valid. |
Alright, then I close this to avoid producing more "surprise". Thanks Bobby. |
This reverts commit 76ab5fb.
Signed-off-by: Nghia Truong <[email protected]>
Reopen as this can be implemented with only changes in |
precision
of decimal type in DType
and Schema
classesprecision
of decimal types Schema
class
precision
of decimal types Schema
classprecision
of decimal types in Schema
class
Signed-off-by: Nghia Truong <[email protected]>
f830e7a
to
771150c
Compare
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
This is still fundamentally the same issue as before. There are no APIs in CUDF that take a Schema which will use the precision. Schema is used by This is better because the schema here is not going to be used to round trip information to CUDF and back. But tt still is fundamentally broken. We are making a change to CUDF for something that CUDF just does not and probably will never support. It is here so that some other library, spark-rapids-jni, can provide a simpler API for functionality that goes beyond what CUDF supports. I am not going to go to fight this any more. This does not break things too horribly. But at a minimum we have to document that precision is completely and totally ignored if it is set. |
Signed-off-by: Nghia Truong <[email protected]>
Thanks Bobby. Yes I understand that this is not a good design but in the meantime we seem do not have a better solution. The only workaround I can think of is to keep a separate flattened array of precisions for all columns along with the nested schema, but that is more error prone. Update: I've added the docs, clearly saying that we add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't like it, but like I said before I am done fighting it.
/merge |
In Spark, the
DecimalType
has a specific number of digits to represent the numbers. However, when creating a data Schema, only type and name of the column are stored, thus we lose that precision information. As such, it would be difficult to reconstruct the original decimal types from cudf'sSchema
instance.This PR adds a
precision
member variable to theSchema
class in cudf Java, allowing it to store the precision number of the original decimal column.Partially contributes to NVIDIA/spark-rapids#11560.